Skip to content

fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score#197

Merged
jeremymanning merged 1 commit into
mainfrom
fix/paper-reviewer-prompt-truncation
May 17, 2026
Merged

fix(paper_reviewer): include real paper body + bibliography in prompt; normalize score#197
jeremymanning merged 1 commit into
mainfrom
fix/paper-reviewer-prompt-truncation

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

  • Reviewers on arXiv-intake papers were never seeing the paper content. _concat_tex sorted alphabetically with a 60KB budget, so the prompt always contained extra_pkgs.tex (~3KB) + (truncated; remaining files: 2) — and main.tex (~250KB) was never inlined. Reviewers correctly issued major_revision_writing verdicts citing "no LaTeX source", but they were judging the truncation, not the paper.
  • state/citations/<PROJ>.yaml is never populated for intake projects, so the bibliography section was always "(no citations recorded)" even when paper/source/ref.bib had 100+ entries right there.
  • ~1/13 specialists per project failed pydantic validation because the LLM picked an accept verdict but wrote score: 0.0 (the validator requires score: 0.5 for accept).

Fixes

  1. _concat_tex rewritten: promote the entry-point file (\documentclass) to the front; truncate IT if needed instead of skipping; default budget bumped to 180KB.
  2. _summarize_bibfile fallback: when state/citations is empty, inline paper/source/*.bib (capped 30KB) so the reviewer can judge the reference set.
  3. handle_response normalizes score from verdict before validation — losing a substantive review to a numeric-formatting slip wasted Dartmouth calls.

Verification

Manually re-ran 8 previously-failing arxiv-intake projects (PROJ-564, 565, 566, 568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist reviews instead of crashing or boilerplate:

Project Verdict Highlight
PROJ-564 accept "Global Skip Connections... resolves tripartite trade-off between compression, fidelity, diffusability"
PROJ-565 accept "Unified benchmark suite with 2,388 instances and 2,251 preference pairs"
PROJ-566 accept "Strong systems contribution with validated scaling axes"
PROJ-568 minor_revision substantive
PROJ-570 minor_revision "source file contamination detected"
PROJ-571 minor_revision "missing hyperparameter value for $\beta_k$" — references Eq. 12 and Algorithm 1
PROJ-576 accept "Strong efficiency / quality trade-off for minute-scale generation"
PROJ-578 major_revision_science correctly flagged "GPT-5.4 / Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names

Reviews now reference Algorithms, Tables, Figures, and hyperparameters by name. The LLM is reading and reasoning about the actual paper, not the package preamble.

Test plan

  • 17 unit tests in test_paper_reviewer_arxiv_intake.py pass (8 new tests for truncation + bib + score-normalization)
  • Full unit suite (395 tests) passes
  • Verified manually against all 8 previously-failing arxiv-intake projects on disk
  • Next paper-review cron tick (every 16h) will confirm the fix sticks under real CI

🤖 Generated with Claude Code

…; normalize score

Reviewers were issuing "no LaTeX source" / "no bibliography" verdicts on
arXiv-intake papers because they literally never saw the paper content:

  * _concat_tex sorted .tex files alphabetically with a 60KB budget. For
    a typical arXiv tarball (extra_pkgs.tex ≈ 3KB sorts first; main.tex
    ≈ 250KB sorts later), the budget got consumed by package
    declarations and main.tex was always skipped. The reviewer's prompt
    contained 3KB of \usepackage lines and a "(truncated; remaining
    files: 2)" footer — no abstract, no methods, no results.

  * state/citations/<PROJ>.yaml is never populated for arXiv-intake
    papers, so the bibliography section was always "(no citations
    recorded)" — even when paper/source/ref.bib was right there with
    100+ entries.

  * One specialist per project (~1/13) failed pydantic validation
    because the LLM picked "accept" verdict but wrote score=0.0 (or
    "minor_revision" with score=0.5). The score is purely derived from
    the verdict — normalize on parse instead of losing a substantive
    review to a numeric formatting slip.

Fixes:
  1. _concat_tex now promotes the entry-point file (containing
     \documentclass) to the front of the ordering, truncates IT to fit
     budget if necessary (vs. silently skipping it), and the default
     budget grew from 60KB → 180KB (~45K tokens, leaves room for the
     response in a 128K context).
  2. _summarize_bibfile fallback: when state/citations is empty, inline
     paper/source/*.bib (capped at 30KB) so the reviewer can see what's
     cited and judge the reference set.
  3. handle_response normalizes score from verdict before validation.

Verified against 8 previously-failing projects (PROJ-564, 565, 566,
568, 570, 571, 576, 578). All 8 now produce substantive 13-specialist
reviews instead of crashing or emitting boilerplate "no source provided"
verdicts. Aggregate verdicts:
  * accept           : PROJ-564, 565, 566, 576
  * minor_revision   : PROJ-568, 570, 571
  * major_revision_sci: PROJ-578 (correctly flagged "GPT-5.4 /
    Claude Sonnet 4.5 / Gemini-3.1-Pro" as unverifiable model names)

Reviews now reference specific Algorithms, Tables, Figures, and
hyperparameters by name — the LLM is reading and reasoning about the
actual paper, not the package preamble.

Adds 9 new unit tests (17 total in test_paper_reviewer_arxiv_intake).
Full unit suite (395 tests) passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit b57889b into main May 17, 2026
5 of 6 checks passed
@jeremymanning jeremymanning deleted the fix/paper-reviewer-prompt-truncation branch May 17, 2026 16:27
jeremymanning added a commit that referenced this pull request May 17, 2026
Picks up the 8 previously-failing arxiv-intake papers (PROJ-564, 565,
566, 568, 570, 571, 576, 578) — all now have substantive 13-specialist
reviews after PR #197 fixed the LaTeX-prompt truncation bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant